Topic Modeling: Turning Conversations into Strategy

R Ladies Abuja

Ifeoma Egbogah

The Rise of Unstructured Data

From Numbers to Words

If you’re in the field of analytics or data science, you’re likely well aware that data is being produced continuously—and at an increasingly rapid pace. (You might even be tired of hearing this repeated!) While analysts are typically trained to work with structured, numeric data in table formats, a significant portion of today’s data boom involves unstructured, text-based information.

Unstructured data represents 80-90% of all new enterprise data, according to Gartner.

Furthermore, it’s growing three times faster than structured data.

Behind the Words

Overview

In this webinar, we will look at:

  • Understanding Topic Modeling
  • Importance of Topi Modeling
  • Topic Modeling Techniques
  • Implementation of Topic Modeling
  • Demo

Understanding Topic Modeling

What is Topic Modeling?

Topic modeling is way to identify themes/semantic patterns in a corpus (complete document).

Topic modeling finds the relationships between words in the text, thereby identifying clusters of words that represent topics.

It is a like an amplified reading, a way to discover themes you may not see yourself.

Glossary:

Corpus: Group of documents

Documents: Newspaper, Blogpost, Tweets, Articles, Journals, Customer reviews etc.

Importance of Topic Modeling

Key Importance of Topic Modeling

Uncovering Hidden Themes:

Topic modeling helps discover latent themes and patterns within unstructured text data that might otherwise be missed, providing a deeper understanding of large datasets.

Efficient Information Retrieval and Organization:

It automatically organizes and groups documents by their main themes, making it easier to find relevant information and creating a manageable structure for large text collections.

Supporting Data-Driven Decisions:

By identifying prevalent topics in customer reviews, social media, or research, organizations can make more informed decisions to improve products, services, and strategies.

Key Importance of Topic Modeling

Automating Text Analysis:

It automates the time-consuming process of manually reading and categorizing large volumes of text, increasing efficiency and reducing human effort.

Enhancing Research and Discovery:

In academia, topic modeling helps analyze research publications to reveal trends and key topics, thereby streamlining the research process and potentially leading to new discoveries.

Improving Customer Experience:

Businesses can use topic modeling to analyze customer service emails or feedback to understand major challenges and concerns, allowing for targeted improvements to service delivery.

Topic Modeling Techniques

Latent Dirichlet Allocation (LDA)

Source: Introduction to Probabilistic Topic Models paper by Blei et. al

Latent Dirichlet Allocation (LDA) is one of the most common algorithms for topic modeling. It is guided by two principles, that:

  • Every document is a mixture of topics
  • Every topic is a mixture of words

Implementation of Topic Modeling

Step 1

Data Preparation

Collect the text data

Examples: - Taylor Swift’s lyrics (there is a Taylor package in R)

  • Spice Girls lyrics

  • BBC News

  • Gutenbergr package

  • quanteda package

Step 2

Preprocessing

Before modeling, we preprocess the data to put in it in a tidy format by:

  • Tokenization (splitting sentences into words)

  • Removing punctuation, numbers

  • Removing stop words (like the, and, is)

  • Find document-word counts

Step 3

Create Document-term Matrix

A matrix that represents the frequency of each word (term) across all documents.

We can cast a one-token-per-row table into a DocumentTermMatrix with tidytext’s cast_sparse().

- Rows = documents;

- Columns = terms/words.

Step 4

Model Fitting

We can then use the LDA() function from the topicmodels package to create a topic model.

Step 5

Interprete and Visualise the Result

  • Extract top keywords per topic.

  • Label the topics manually (e.g., “Customer Service Issues” or “Product Features”).

  • Visualize using tools like: ggplot2 package

Step 6

Apply Result

Summarize the result Identify customer pain points Track emerging trends etc

Packages

Packages

We will make use of the following packages

  • tidyverse

  • tidytext

  • topicmodels

  • tm

Demo

BBC News: Ever wondered what the news was really talking about beneath the headlines?

We’ll be working with the BBC News dataset, a collection of 2225 news articles published between 2004 and 2005, covering five major categories: Business, Entertainment, Politics, Sport, and Technology.

The goal of this project is to combine all the articles into one dataset and apply unsupervised topic modeling to uncover the hidden, underlying themes within the news stories. By analyzing the text, we’ll reveal the most prominent topics that dominated the public narrative during that time — all without manually assigning any labels.

About the Dataset

The BBC News dataset contains three key columns:

  • Title – the headline of each news article

  • Description – a brief summary or excerpt from the article

  • Category – the labeled topic (e.g., Business, Sport, etc.)

For this project, we focus on the Description column as the primary source of text. This is where we perform all of our text cleaning and preprocessing — removing punctuation, converting to lowercase, tokenizing, eliminating stop words, and so on.

Once the topic model is built, we use the Category column as a benchmark to evaluate how well our model’s discovered topics align with the actual labeled categories. This gives us a way to check the quality and accuracy of our model.

glimpse(bbc)
Rows: 2,225
Columns: 4
$ ...1        <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Title       <chr> "India calls for fair trade rules", "Sluggish economy hits…
$ Description <chr> "india attend g7 meet seven lead industrialis nation frida…
$ Category    <chr> "Business", "Business", "Business", "Business", "Business"…

Data Cleaning

We start by using the clean_names() function from the janitor package to standardize all column names (e.g., convert to lowercase, replace spaces with underscores).

Then we removed duplicate rows based on the description column using distinct(description, .keep_all = TRUE), so each news article description appears only once.

The data had 141 duplicate rows.

bbc <- bbc |> 
  janitor::clean_names() |> 
  distinct(description, .keep_all = TRUE)

Preprocessing

Let’s tokenized the description column using unnest_tokens() to break down each news article into individual words.

Then use count(word, sort = TRUE) to calculate the frequency of each word across the dataset and sort them in descending order.

# A tibble: 5 × 2
  word      n
  <chr> <int>
1 mr     2799
2 year   2655
3 would  2401
4 also   1997
5 peopl  1855

Preprocessing contd.

How about title per word?

bbc |> 
  unnest_tokens(word, description) |> 
  count(title, word, sort = TRUE)|> 
  head(5)
# A tibble: 5 × 3
  title                            word      n
  <chr>                            <chr> <int>
1 Scissor Sisters triumph at Brits song     81
2 Brits debate over 'urban' music  music    72
3 Losing yourself in online gaming game     66
4 Kilroy launches 'Veritas' party  parti    60
5 Minimum wage increased to £5.05  wage     60

Train a model

To train a topic model using LDA() from the topicmodel package, we need to create sparse matrix from our tidy dataframe of tokens using cast_sparse(category, word, n)

[1] 1967 3458

This means there are 1967 titles (i.e documents) and different tokens (i.e. terms or words) in our dataset for modeling.

Train a Model

A topic model like this one models:

  • each document as a mixture of topics
  • each topic as a mixture of words

The most important parameter when training a topic modeling is K, the number of topics. This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time.

Explore Topic Model Result

Beta matrix

To dig deeper into our topic model, we can use the tidy() function to convert the results into a dataframe that we can work with. This gives us two types of outputs:

  • Beta matrix: shows the probability of each word belonging to each topic (topic-word distribution)

  • Gamma matrix: shows how much each topic contributes to each document (document-topic distribution)

Beta Matrix Contd.

We’ll start by looking at the beta matrix first.

# A tibble: 17,290 × 3
   topic term        beta
   <int> <chr>      <dbl>
 1     1 song  0.00000532
 2     2 song  0.00736   
 3     3 song  0.00000587
 4     4 song  0.00000655
 5     5 song  0.00000511
 6     1 music 0.00000532
 7     2 music 0.0304    
 8     3 music 0.00000587
 9     4 music 0.00000655
10     5 music 0.00000511
# ℹ 17,280 more rows

Beta Matrix Visualisation

Since the output is a tidy dataframe, we can easily manipulate it — including visualizing the top words with the highest probabilities for each topic.

Explore Topic Model Result

Gamma matrix

The probability that this document is about this topic.

# A tibble: 9,835 × 3
   document                          topic  gamma
   <chr>                             <int>  <dbl>
 1 Scissor Sisters triumph at Brits      1 0.0127
 2 Brits debate over 'urban' music       1 0.0164
 3 Losing yourself in online gaming      1 0.0207
 4 Kilroy launches 'Veritas' party       1 0.701 
 5 Minimum wage increased to £5.05       1 0.0951
 6 Nadal puts Spain 2-0 up               1 0.0229
 7 Mobile games come of age              1 0.0190
 8 Apple laptop is 'greatest gadget'     1 0.0115
 9 Peer-to-peer nets 'here to stay'      1 0.0479
10 Terror powers expose 'tyranny'        1 0.295 
# ℹ 9,825 more rows

Explore Topic Model Result Contd

Gamma matrix

Explore Topic Model Result Contd

Gamma matrix

Most common topic in the document.

Explore Topic Model Result Contd

Gamma matrix

Probability of a document belonging to a topic

Explore Topic Model Result Contd

Gamma matrix